About the project

31.10.2019

“This is a Massive Open Online Course (MOOC) meaning that everything you need to complete the course in terms of materials and exercises will be freely available online.”

When I first started to think about learning online. I realized that this is a good opportunity for me because like all of us, as well I have very limited time to use to learn new skills. MOOC concept is equal for everybody and benefits everybody of us who have “whatever reason” could not come to a traditional classroom setting. There is so much I wish to learn about using R and data analytics. Fortunately, I have quite good basic knowledge about biostatistics but I have only very basic skills using R.

After the first exercise, I found that online learning does seem to take at least the same time as traditional classroom learning, but you can decide when you put your effort into learning. I heard about this course from the UEF´s Doctoral Programme in Clinical Research coordinator and prof. Reijo Sund.

You can find my GitHub repository from here

Br,

Juuso


Regression and model validation

The theme for the week 2 was regression analysis. Week 2 exercises consist of 1) data wrangling exercises and 2) data analysis exercises. You can find results of my second week below.

# read the data into memory 
std14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)

The dataset consist from the 7 different variables (gender (factor), age (int), attitude (num), deep (num), stra(num), surf(num), and point(int)) and 166 observations. I excluded from the data those observations where the exam points were 0. You can find variables names and short descriptions and some basic charasteristics about the data below:

#Explore structure and dimensions of the dataset
str(std14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(std14)
## [1] 166   7
summary(std14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

According the graphical overview, age and gender variables are skewed but all the others variables are fairly normally distributed.

# Access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2); library(corrplot)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## corrplot 0.84 loaded
glimpse(std14)
## Observations: 166
## Variables: 7
## $ gender   <fct> F, M, F, M, M, F, M, F, M, F, M, F, F, F, M, F, F, F, M, F...
## $ age      <int> 53, 55, 49, 53, 49, 38, 50, 37, 37, 42, 37, 34, 34, 34, 35...
## $ attitude <dbl> 3.7, 3.1, 2.5, 3.5, 3.7, 3.8, 3.5, 2.9, 3.8, 2.1, 3.9, 3.8...
## $ deep     <dbl> 3.583333, 2.916667, 3.500000, 3.500000, 3.666667, 4.750000...
## $ stra     <dbl> 3.375, 2.750, 3.625, 3.125, 3.625, 3.625, 2.250, 4.000, 4....
## $ surf     <dbl> 2.583333, 3.166667, 2.250000, 2.250000, 2.833333, 2.416667...
## $ points   <int> 25, 12, 24, 10, 22, 21, 21, 31, 24, 26, 31, 31, 23, 25, 21...
gather(std14) %>% glimpse
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Observations: 1,162
## Variables: 2
## $ key   <chr> "gender", "gender", "gender", "gender", "gender", "gender", "...
## $ value <chr> "F", "M", "F", "M", "M", "F", "M", "F", "M", "F", "M", "F", "...
# draw a bar plot of each variable and add frequency count labels above the bars
gather(std14) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+ geom_text(stat='count', aes(label=..count..), vjust=-1)
## Warning: attributes are not identical across measure variables;
## they will be dropped

My aim was was find out the relationship between the exam points and attitude, age, and gender. Practically that mean how attitude, age, and gender associated with the achieved exam points in this population. First of all I made a correlation matrix (see below). Correlation is described as the analysis which lets us know the association or the absence of the relationship between two variables ‘x’ and ‘y’.

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A positive correlation mean a direct association between the two variables and a negative correlation a inverse association between two variables. If we focus on my main aim, we can found a positive correlation between points, gender (R=0.093) and attitude (R=0.436) and a negative correlation between points and age (R=0.093).

# convert gender as integer
std14$gender <- as.integer(std14$gender)

# calculate the correlation matrix and round it
cor.matrix <- cor(std14)
head(round(cor.matrix,2))
##          gender   age attitude  deep  stra  surf points
## gender     1.00  0.12     0.29  0.06 -0.15 -0.11   0.09
## age        0.12  1.00     0.02  0.03  0.10 -0.14  -0.09
## attitude   0.29  0.02     1.00  0.11  0.06 -0.18   0.44
## deep       0.06  0.03     0.11  1.00  0.10 -0.32  -0.01
## stra      -0.15  0.10     0.06  0.10  1.00 -0.16   0.15
## surf      -0.11 -0.14    -0.18 -0.32 -0.16  1.00  -0.14
cor.matrix
##               gender         age    attitude        deep        stra       surf
## gender    1.00000000  0.11901733  0.29423035  0.05809597 -0.14552789 -0.1126999
## age       0.11901733  1.00000000  0.02220071  0.02507804  0.10244409 -0.1414052
## attitude  0.29423035  0.02220071  1.00000000  0.11024302  0.06174177 -0.1755422
## deep      0.05809597  0.02507804  0.11024302  1.00000000  0.09650255 -0.3238020
## stra     -0.14552789  0.10244409  0.06174177  0.09650255  1.00000000 -0.1609729
## surf     -0.11269987 -0.14140516 -0.17554218 -0.32380198 -0.16097287  1.0000000
## points    0.09290782 -0.09319032  0.43652453 -0.01014849  0.14612247 -0.1443564
##               points
## gender    0.09290782
## age      -0.09319032
## attitude  0.43652453
## deep     -0.01014849
## stra      0.14612247
## surf     -0.14435642
## points    1.00000000
# visualize the correlation matrix
corrplot(cor.matrix, method = "number")

After correlation analysis I made and a regression analysis. Regression analysis, predicts the value of the dependent variable based on the known value of the independent variable, assuming that average mathematical relationship between two or more variables.

# create a regression model with multiple explanatory variables
my_model1 <- lm(points ~ attitude + age + gender, data = std14)

# print out a summary of the model
summary(my_model1)
## 
## Call:
## lm(formula = points ~ attitude + age + gender, data = std14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.75963    2.31478   5.944 1.65e-08 ***
## attitude     3.60657    0.59322   6.080 8.34e-09 ***
## age         -0.07586    0.05367  -1.414    0.159    
## gender      -0.33054    0.91934  -0.360    0.720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08
# draw diagnostic plots using the plot() function. Choose the plots Residuals vs Fitted values = 1, Normal QQ-plot = 2 and Residuals vs Leverage = 5
par(mfrow = c(2,2))
plot(my_model1, which = c(1,2,5))

Let’s explain the analysis output step by step.

Formula Call

As you can see, the first item shown in the output is the formula R used to fit the data. Note the simplicity in the syntax: the formula just needs the predictors (attitude, age, gender) and the target/response variable (points), together with the data being used (std14).

Residuals

The next item in the model output talks about the residuals. Residuals are essentially the difference between the actual observed response values and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0).

Coefficients

The next section in the model output talks about the coefficients of the model.

Coefficient - Estimate

The coefficient Estimate contains two rows; the first one is the intercept. The intercept is the point where the function crosses the y-axis. The second row in the Coefficients is the slope. The slope term in our model is saying that for every attitude increase required the points goes up by 3.6.

Coefficient - Standard Error

The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable.

Coefficient - t value

The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between attitude and exam points.

Coefficient - Pr(>t)

The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictors (attitude, age and gender) and response (exam points) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero. Note the ‘signif. Codes’ associated to each estimate. Three stars (or asterisks) represent a highly significant p-value. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between attitude and exam points.

Residual Standard Error

Residual Standard Error is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term E. Due to the presence of this error term, we are not capable of perfectly predicting our response variable (exam points) from the predictors (attitude, age and gender) one. The Residual Standard Error is the average amount that the response (exam points) will deviate from the true regression line. In our example, the actual attitude value can deviate from the true regression line by approximately 5.315 points, on average.

Multiple R-squared, Adjusted R-squared

The R-squared (R2) statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. R2 is a measure of the linear relationship between our predictor variable (attitude, age and gender) and our response / target variable (exam points). It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In our example, the R2 we get is 0.2018. Or roughly 20% of the variance found in the response variable (exam points) can be explained by the predictor variable (attitude, age and gender).

F-Statistic

F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. However, how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. Generally, when the number of data points is large, an F-statistic that is only a little bit larger than 1 is already sufficient to reject the null hypothesis (H0 : There is no relationship between attitude+age+gender, and exam points). The reverse is true as if the number of data points is small, a large F-statistic is required to be able to ascertain that there may be a relationship between predictor and response variables. In our example the F-statistic is 13,65 which is relatively larger than 1 given the size of our data.

Last I checked graphically the validity of the model assumptions. For that I produced the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Let’s begin by looking at the Residual-Fitted plot coming from a linear model that is fit to data that perfectly satisfies all the of the standard assumptions of linear regression. The scatterplot shows good setup for a linear regression: The data appear to be well modeled by a linear relationship between y and x, and the points appear to be randomly spread out about the line, with no discerninle non-linear trends or changes in variability.

The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. In this case residual match pretty good to the diagonal line. It means that residuals are pretty normally distributed (that is on another assumption).

Outliers and the Residuals vs Leverage plot. There’s no single accepted definition for what consitutes an outlier. This case is the typical look when there is no influential case, or cases. Because we can not see Cook’s distance lines (a red dashed line) because all cases are well inside of the Cook’s distance lines.


Logistic regression

Data Set Information:

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Source:

Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez

Relevant Papers:

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Let’s start working!

# read the data into memory 
alc <- read.csv("C:/Users/juusov/Documents/IODS-project/Data/alc.csv", header = TRUE, sep = ",")
# print out the names of the variables in the data
names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Exploring the data

My aim is find out how age, free time after school, current health status, and number of school absences associated with high/low alcohol consumption among students. My hypothesis is that among heavy drinkers (who are more frequently men than women) have more school absences and free time, they are older, and they have poorer perceived health. Let’s pick the variables we’re interested in and look at some basic statistics.

# access the tidyverse libraries dplyr, ggplot2, corrplot, and boot 
library(tidyr); library(dplyr); library(ggplot2); library(corrplot); library(boot)

# produce mean statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age), mean_free_time = mean(freetime), mean_health = mean(health), mean_absence = mean(absences))
## # A tibble: 4 x 7
## # Groups:   sex [2]
##   sex   high_use count mean_age mean_free_time mean_health mean_absence
##   <fct> <lgl>    <int>    <dbl>          <dbl>       <dbl>        <dbl>
## 1 F     FALSE      156     16.6           2.93        3.38         4.22
## 2 F     TRUE        42     16.5           3.36        3.40         6.79
## 3 M     FALSE      112     16.3           3.39        3.71         2.98
## 4 M     TRUE        72     17.0           3.5         3.88         6.12

Results are grouped by sex and high/low alcohol consumption among students. We can see that among female there is 156 low/moderate drinkers and 42 heavy drinkers. Respectively in men there 112 low/moderate drinkers and 72 heavy users. Forunately in both sex there is more low/moderate drinkers than heavy drinkers. See other details from above.

Boxplots

# boxplots all populatio
par(mfrow=c(1,5))
boxplot(alc$age, main="Age")
boxplot(alc$freetime, main="Freetime")
boxplot(alc$health, main=" Current Health Status")
boxplot(alc$absences, main="Number of School Absences")
boxplot(alc$alc_use, main="Alcohol using")

# boxplots by sex
par(mfrow=c(1,5))
boxplot(alc$age~alc$sex, main="Age")
boxplot(alc$freetime~alc$sex, main="Freetime")
boxplot(alc$health~alc$sex, main=" Current Health Status")
boxplot(alc$absences~alc$sex, main="Number of School Absences")
boxplot(alc$alc_use~alc$sex, main="Alcohol using")

# boxplots by alcohol high use
par(mfrow=c(1,4))
boxplot(alc$age~alc$high_use, main="Age")
boxplot(alc$freetime~alc$high_use, main="Freetime")
boxplot(alc$health~alc$high_use, main=" Current Health Status")
boxplot(alc$absences~alc$high_use, main="Number of School Absences")

# choose columns to keep for the analyses
keep_columns <- c("age", "sex", "freetime", "health", "absences", "alc_use", "high_use")

# select the 'alc_subset' to create a new dataset 
alc_subset <- dplyr::select(alc, one_of(keep_columns))

# draw a bar plot of each variable 
gather(alc_subset) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

As we can see from distributions plots and bars only sex and freetime are normally distributed. My hypothesis is partially true. Male seems to use more alcohol than women. Heavy drinkers are older than moderate drinkers and they have more school absences but there is no diffrences between drinking habits and freetime or current health status.

Logistic regression analyses

# model with glm
m <- glm(alc_subset$high_use ~ alc_subset$age + alc_subset$sex + alc_subset$freetime + alc_subset$health + alc_subset$absences, data = alc, family = "binomial")

#print out summary
summary(m)
## 
## Call:
## glm(formula = alc_subset$high_use ~ alc_subset$age + alc_subset$sex + 
##     alc_subset$freetime + alc_subset$health + alc_subset$absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1098  -0.8203  -0.6121   1.0681   2.0876  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -5.94027    1.81093  -3.280 0.001037 ** 
## alc_subset$age       0.18163    0.10220   1.777 0.075542 .  
## alc_subset$sexM      0.86250    0.24770   3.482 0.000498 ***
## alc_subset$freetime  0.28776    0.12533   2.296 0.021677 *  
## alc_subset$health    0.05873    0.08800   0.667 0.504507    
## alc_subset$absences  0.09335    0.02301   4.058 4.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 420.99  on 376  degrees of freedom
## AIC: 432.99
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                              OR        2.5 %     97.5 %
## (Intercept)         0.002631321 7.021357e-05 0.08669783
## alc_subset$age      1.199171825 9.830119e-01 1.46888737
## alc_subset$sexM     2.369086597 1.464751e+00 3.87560256
## alc_subset$freetime 1.333435235 1.045768e+00 1.71125765
## alc_subset$health   1.060491092 8.937200e-01 1.26298969
## alc_subset$absences 1.097850460 1.051579e+00 1.15103254

“When a logistic regression is calculated, the regression coefficient (b1) is the estimated increase in the log odds of the outcome per unit increase in the value of the exposure. In other words, the exponential function of the regression coefficient (eb1) is the odds ratio associated with a one-unit increase in the exposure. An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.” (Szumilas M. Explaining odds ratios [published correction appears in J Can Acad Child Adolesc Psychiatry. 2015 Winter;24(1):58]. J Can Acad Child Adolesc Psychiatry. 2010;19(3):227–229.)

Results of logistic regression model

Let’s look at coefficients first. In this case sex, freetime, and school absences significantly associated with alchol high use. If we look at the odds ratios (OR). We can conclude that sex increase 2.36 (136%) times, freetime 1.33 (33%) times, and school absences 1.09 (9%) times risk for alcohol high use. This analysis get us closer to final conclusion. The hypothesis is still alive partly, now we can say that sex, freetime and school absences statistically associated with higher alcohol consumption in this population.

Prediction and validation

Next we can compare the values predicted with the real values and estimate how good our model is in prediction. In conclusion we can say that the model accuracy is acceptable.

#fit the model
m2 <- glm(high_use ~ sex + freetime + absences, data = alc_subset, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc_subset'
alc_subset <- mutate(alc_subset, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc_subset <- mutate(alc_subset, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc_subset, sex, freetime, absences, high_use, probability, prediction) %>% tail(20)
##     sex freetime absences high_use probability prediction
## 363   F        4        8    FALSE  0.30998649      FALSE
## 364   F        5        9    FALSE  0.39835071      FALSE
## 365   F        4        0    FALSE  0.17042678      FALSE
## 366   F        3        3    FALSE  0.17090391      FALSE
## 367   F        4        2     TRUE  0.19988715      FALSE
## 368   F        1        0    FALSE  0.07923997      FALSE
## 369   F        5       14     TRUE  0.51915876       TRUE
## 370   M        2        4     TRUE  0.28999668      FALSE
## 371   M        4        2    FALSE  0.37497539      FALSE
## 372   M        4        3    FALSE  0.39816238      FALSE
## 373   M        3        0    FALSE  0.26961553      FALSE
## 374   M        4        7     TRUE  0.49452118      FALSE
## 375   F        3        1    FALSE  0.14494141      FALSE
## 376   F        4        6    FALSE  0.26977031      FALSE
## 377   F        4        2    FALSE  0.19988715      FALSE
## 378   F        3        2    FALSE  0.15748815      FALSE
## 379   F        2        2    FALSE  0.12270339      FALSE
## 380   F        1        3    FALSE  0.10346444      FALSE
## 381   M        4        4     TRUE  0.42181554      FALSE
## 382   M        4        2     TRUE  0.37497539      FALSE
# initialize a plot of 'high_use' versus 'probability' in 'alc_subset'
g <- ggplot(alc_subset, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
geom_point(col = 'prediction')
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
g

# tabulate the target variable versus the predictions
table(high_use = alc_subset$high_use, prediction = alc_subset$prediction)%>%prop.table()%>%addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66492147 0.03664921 0.70157068
##    TRUE  0.23036649 0.06806283 0.29842932
##    Sum   0.89528796 0.10471204 1.00000000
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the data
loss_func(class = alc_subset$high_use, prob = alc_subset$probability)
## [1] 0.2670157
# K-fold cross-validation
cv <- cv.glm(data = alc_subset, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3464749


Clustering and classification

# access the packages
library(MASS); library(corrplot); library(tidyr); library(corrplot); library(dplyr); library(ggplot2); 
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Data Set Information:

“Boston {MASS}” dataset consist of housing values in suburbs of Boston. The Boston data frame has 506 rows and 14 columns.

This data frame contains the following variables:

crim per capita crime rate by town.

zn proportion of residential land zoned for lots over 25,000 sq.ft.

indus proportion of non-retail business acres per town.

chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox nitrogen oxides concentration (parts per 10 million).

rm average number of rooms per dwelling.

age proportion of owner-occupied units built prior to 1940.

dis weighted mean of distances to five Boston employment centres.

rad index of accessibility to radial highways.

tax full-value property-tax rate per $10,000.

ptratio pupil-teacher ratio by town.

black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat lower status of the population (percent).

medv median value of owner-occupied homes in $1000s.

# Change the shape of the data from wide-format to long-format
require(reshape2)
## Loading required package: reshape2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
head(melt.boston)
##   variable   value
## 1     crim 0.00632
## 2     crim 0.02731
## 3     crim 0.02729
## 4     crim 0.03237
## 5     crim 0.06905
## 6     crim 0.02985
# draw a bar plot of each variable
ggplot(data = melt.boston, aes(x = value)) + stat_density() + facet_wrap(~variable, scales = "free")

# plot matrix of the Boston dataset variables
pairs(Boston)

# calculate the correlation matrix of the Boston dataset and round it
cor_matrix<-cor(Boston) 

# print the correlation matrix
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix of the dataset
corrplot(cor_matrix, method="number", type='upper', diag = FALSE)

Several of the variables are highly skewed.In particular, crim, zn, chaz, dis, and black are highly skewed. Some of the others appear to have moderate skewness. The skewed distributions suggests that some transformations on variables could improve performance of variables in the models. We can observe several highly correlated variables in the correlation matrix. We have to be careful with highly correlated variables to avoid overcome their influence in the models. The next thing we need to do is standardize the dataset and print out summaries of the scaled data, then create a categorical variable of the crime rate in the Boston dataset using the quantiles as the break points, drop the old crime rate variable from the dataset, and create training and testing data (80% of the data belongs to the train set).

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'. Using the quantiles as the break points in the categorical variable.
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Now the test data has created. Next we going to fit the linear discriminant analysis on the train dataset. Notice that in this case we have four classes. The LDA algorithm starts by finding directions that maximize the separation between classes, then use these directions to predict the class of individuals. These directions, called linear discriminants, are a linear combinations of predictor variables.

LDA assumes that predictors are normally distributed (Gaussian distribution) and that the different classes have class-specific means and equal variance/covariance.

LDA determines group means and computes, for each individual, the probability of belonging to the different groups. The individual is then affected to the group with the highest probability score.

The lda() outputs contain the following elements:

Prior probabilities of groups: the proportion of training observations in each group. Group means: Shows the mean of each variable in each group. Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule.

source: http://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/#linear-discriminant-analysis---lda

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2475248 0.2524752 0.2376238 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.87289539 -0.8875424 -0.12375925 -0.8644425  0.4059990 -0.8527908
## med_low  -0.08912598 -0.3083338 -0.03610305 -0.5579682 -0.1405807 -0.2897063
## med_high -0.39307863  0.1869203  0.07506213  0.3960485  0.0498597  0.3952586
## high     -0.48724019  1.0172418 -0.10828322  1.0307452 -0.4272108  0.8011386
##                 dis        rad        tax     ptratio       black       lstat
## low       0.7894046 -0.6839198 -0.7393332 -0.44615947  0.37445768 -0.74319682
## med_low   0.3399591 -0.5454537 -0.4748340 -0.07368943  0.31092733 -0.11251314
## med_high -0.3527833 -0.3918744 -0.2865599 -0.17237416  0.03997162  0.09917451
## high     -0.8477376  1.6368728  1.5131579  0.77931510 -0.85865626  0.96558055
##                 medv
## low       0.50130887
## med_low  -0.01791937
## med_high  0.08565543
## high     -0.74304345
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.08615375  0.729103354 -0.92685225
## indus    0.02608711 -0.309494320  0.32747034
## chas    -0.09721032 -0.068330031  0.14036976
## nox      0.39025299 -0.772616011 -1.48954905
## rm      -0.06899409 -0.096009440 -0.14202615
## age      0.26416318 -0.316599830  0.07108204
## dis     -0.03725367 -0.441591996  0.22902553
## rad      2.94587444  1.105911226 -0.09636998
## tax      0.02016319 -0.170647983  0.61633506
## ptratio  0.13987417 -0.104279706 -0.37149744
## black   -0.12945747  0.003882527  0.10479589
## lstat    0.24417812 -0.332310650  0.31022352
## medv     0.17058063 -0.548336885 -0.24349141
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9454 0.0417 0.0129
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The train data was devided in quantiles. The crime variable is as actarget variable. In the plot we see four different clusters. Three of them are in overlapped and one cluster is far away from other clusters. Look at the arrows tells us which of the affect most on the classification (rad, zn, nox) but because there is so much variables it is hard to recognize other variables.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       5        1    0
##   med_low    9      10        7    0
##   med_high   2       8       14    0
##   high       0       0        0   31
#Calculate accuracy percent of the model
correct_predicts <- 100 * mean(lda.pred$class==correct_classes)
correct_predicts <- round(correct_predicts, digits = 0)

#Print correct predicts percentage
print(correct_predicts)
## [1] 69

We split our data earlier so that we have the test set and the correct class labels. The prediction model perform on test data is acceptable but not perfect (prediction accuracy is 75%). It predicts high crime rate perfectly but lower rates worse.

# load the data
data("Boston")

# Standardizing Boston dataset
scaled_boston <- scale(Boston)

# euclidean distance matrix
dist_eu <- dist(scaled_boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(scaled_boston, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(scaled_boston, centers = 3)

# plot the scaled_oston dataset with clusters
pairs(scaled_boston, col = km$cluster)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scaled_boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(scaled_boston, centers = 3)

# plot the scaled_boston dataset with clusters
pairs(scaled_boston, col = km$cluster)

I tested many different number of clusters. Based on visualiztion the results suggest that 3 is the optimal number of clusters as it appears to be the bend in the elbow (= when the total WCSS drops radically).

Bonus

# load the data
data("Boston")

# Standardizing Boston dataset
scaled_kmeans_boston <- scale(Boston)

scaled_kmeans_boston <- as.data.frame(scaled_kmeans_boston)

# k-means clustering
km <-kmeans(scaled_kmeans_boston, centers = 3)

lda_kmeans <- lda(km$cluster ~ ., data = scaled_kmeans_boston)
lda_kmeans
## Call:
## lda(km$cluster ~ ., data = scaled_kmeans_boston)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2470356 0.3260870 0.4268775 
## 
## Group means:
##         crim         zn      indus         chas        nox         rm
## 1 -0.3989700  1.2614609 -0.9791535 -0.020354653 -0.8573235  1.0090468
## 2  0.7982270 -0.4872402  1.1186734  0.014005495  1.1351215 -0.4596725
## 3 -0.3788713 -0.3578148 -0.2879024  0.001080671 -0.3709704 -0.2328004
##           age        dis        rad        tax     ptratio      black
## 1 -0.96130713  0.9497716 -0.5867985 -0.6709807 -0.80239137  0.3552363
## 2  0.79930921 -0.8549214  1.2113527  1.2873657  0.59162230 -0.6363367
## 3 -0.05427143  0.1034286 -0.5857564 -0.5951053  0.01241316  0.2805140
##        lstat        medv
## 1 -0.9571271  1.06668290
## 2  0.8622388 -0.67953738
## 3 -0.1047617 -0.09820229
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03206338 -0.19094456
## zn       0.02935900 -1.07677218
## indus    0.63347352 -0.09917524
## chas     0.02460719  0.10009606
## nox      1.11749317 -0.75995105
## rm      -0.18841682 -0.57360135
## age     -0.12983139  0.47226685
## dis      0.04493809 -0.34585958
## rad      0.67004295 -0.08584353
## tax      1.03992455 -0.58075025
## ptratio  0.25864960 -0.02605279
## black   -0.01657236  0.01975686
## lstat    0.17365575 -0.41704235
## medv    -0.06819126 -0.79098605
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8506 0.1494
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda_kmeans, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_kmeans, myscale = 4)

In the plot we see two overlapped cluster and one cluster which away from other clusters. The arrows tells us thatnox, zn, tax and medv the most influential variables in the model.

Super Bonus

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)